The XML file produced by SDAtoXML is a valid DDI file, but it has only minimal study-level information. Besides the variable definitions, the file contains only the study title, the number of cases, and (if a DDL file is referenced) the dimensions of the original ASCII data file. The current date is also stored as the date on which the DDI file was produced.
These files are used to indicate which variables to document, and in which order. If neither of the following files is specified, the program will output variable definitions for all variables in the SDA dataset, in alphabetical order (except that CASEID will be the first variable defined).
If a DDL file is specified, some study-level information is taken from the DDL file and output to the XML file. This includes the dimensions of the original ASCII data file, as contained in the ’reclen=’, and ’records/case=’ specifications.
All of the specifications for each variable, however, are taken only from the SDA dataset, not from the DDL file. Only the names of the variables, given after the ’name=’ specifications in the DDL file, are used by the program (as a variable list).
Variable names may be listed one per line or several per line, separated by spaces, tabs, or commas. Blank lines in this file are ignored, as is everything to the right of a pound sign (#).
If a variable has more than the specified number of categories, no frequencies or percentages for individual categories are output. However, any category labels that have been defined for a variable will always be output, regardless of the number of categories.
If category percentages are output, they are calculated on the basis of ALL cases, both valid and invalid. The system-missing category, if present, will be output with a ‘.’ (period or dot) as the category value.
The statistics generated are the following: mean, median, mode, standard deviation, number of valid cases, number of invalid cases, minimum valid category value, and maximum valid category value. If a variable has a very large number of distinct categories, the median and mode may not be computed, but the other statistics will be output for all variables.
The ’-n’ option allows the user to define what is meant by ’shorter’ or ’longer’ labels. If the length of the category label in SDA is less than or equal to the specified limit (default=60), the category label (if any) will be output using the ’labl’ element. If the label is longer than the specified limit, it will be output using the ’txt’ element.
In SDA it is possible to define category labels that have both a long and a short version. The long category label would generally be used in a codebook, whereas the short version would be used in a table. In defining such labels for SDA, the short version of a category label is put between square brackets. An example of such a label would be:
Definitely will vote in the next election [Definitely vote]When converting such labels into XML definitions, the short label will be put out using the ’labl’ element, and the long label will be put out using the ’txt’ element.
If the variable names have been changed to upper- or lower-case letters (by the ‘-c’ or ‘-l’ options), those changes will be reflected in the names of the variables written to this file.
sdatoxml -s /mysda -o myddi.xml
sdatoxml -s . -o myddi.xml -c -m 100
DDI | Data Documentation Initiative - Version 2 |
DDL | Data Description Language used for some SDA Programs |
xconvert | Convert SAS, SPSS, or Stata definitions into XML (DDI) |